Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Bibliographic component extraction from references based on a text recognition error model

Identifieur interne : 001335 ( Main/Exploration ); précédent : 001334; suivant : 001336

Bibliographic component extraction from references based on a text recognition error model

Auteurs : Atsuhiro Takasu [Japon] ; Kenro Aihara [Japon]

Source :

RBID : ISTEX:13F462FB44D7BB031DA1EE909AF927718C78AF9A

English descriptors

Abstract

Citation linkage is important for information retrieval and navigation in digital libraries. This paper proposes a novel method to extract bibliographic components from reference strings obtained from scanned document images. The proposed method uses an extended hidden Markov model representing both OCR error patterns and the syntactical structure of references simultaneously. Rule‐based systems usually have difficulty in obtaining rules. The proposed method gives a solution to this problem by parameter estimation of the model from training data. We applied the proposed method to references obtained by OCR with a large bibliographic database, and achieved about 90% extraction accuracy. Furthermore, the error analysis shows that the proposed method can find the approximate position of bibliographic components in the reference with very high accuracy. © 2005 Wiley Periodicals, Inc. Syst Comp Jpn, 36(7): 13–22, 2005; Published online in Wiley InterScience (www.interscience. wiley.com). DOI 10.1002/scj.20323

Url:
DOI: 10.1002/scj.20323


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Bibliographic component extraction from references based on a text recognition error model</title>
<author>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
</author>
<author>
<name sortKey="Aihara, Kenro" sort="Aihara, Kenro" uniqKey="Aihara K" first="Kenro" last="Aihara">Kenro Aihara</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:13F462FB44D7BB031DA1EE909AF927718C78AF9A</idno>
<date when="2005" year="2005">2005</date>
<idno type="doi">10.1002/scj.20323</idno>
<idno type="url">https://api.istex.fr/document/13F462FB44D7BB031DA1EE909AF927718C78AF9A/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000044</idno>
<idno type="wicri:Area/Istex/Curation">000043</idno>
<idno type="wicri:Area/Istex/Checkpoint">000C44</idno>
<idno type="wicri:doubleKey">0882-1666:2005:Takasu A:bibliographic:component:extraction</idno>
<idno type="wicri:Area/Main/Merge">001371</idno>
<idno type="wicri:Area/Main/Curation">001335</idno>
<idno type="wicri:Area/Main/Exploration">001335</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Bibliographic component extraction from references based on a text recognition error model</title>
<author>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
<affiliation wicri:level="3">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Research Center for Testbeds and Prototyping, National Institute of Informatics, Tokyo</wicri:regionArea>
<placeName>
<settlement type="city">Tokyo</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Aihara, Kenro" sort="Aihara, Kenro" uniqKey="Aihara K" first="Kenro" last="Aihara">Kenro Aihara</name>
<affiliation wicri:level="3">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Software Research Division, National Institute of Informatics, Tokyo</wicri:regionArea>
<placeName>
<settlement type="city">Tokyo</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Systems and Computers in Japan</title>
<title level="j" type="abbrev">Syst. Comp. Jpn.</title>
<idno type="ISSN">0882-1666</idno>
<idno type="eISSN">1520-684X</idno>
<imprint>
<publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>Hoboken</pubPlace>
<date type="published" when="2005-06-30">2005-06-30</date>
<biblScope unit="volume">36</biblScope>
<biblScope unit="issue">7</biblScope>
<biblScope unit="page" from="13">13</biblScope>
<biblScope unit="page" to="22">22</biblScope>
</imprint>
<idno type="ISSN">0882-1666</idno>
</series>
<idno type="istex">13F462FB44D7BB031DA1EE909AF927718C78AF9A</idno>
<idno type="DOI">10.1002/scj.20323</idno>
<idno type="ArticleID">SCJ20323</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0882-1666</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>bibliographic matching</term>
<term>digital library</term>
<term>hidden Markov model</term>
<term>string matching</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Citation linkage is important for information retrieval and navigation in digital libraries. This paper proposes a novel method to extract bibliographic components from reference strings obtained from scanned document images. The proposed method uses an extended hidden Markov model representing both OCR error patterns and the syntactical structure of references simultaneously. Rule‐based systems usually have difficulty in obtaining rules. The proposed method gives a solution to this problem by parameter estimation of the model from training data. We applied the proposed method to references obtained by OCR with a large bibliographic database, and achieved about 90% extraction accuracy. Furthermore, the error analysis shows that the proposed method can find the approximate position of bibliographic components in the reference with very high accuracy. © 2005 Wiley Periodicals, Inc. Syst Comp Jpn, 36(7): 13–22, 2005; Published online in Wiley InterScience (www.interscience. wiley.com). DOI 10.1002/scj.20323</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
</country>
<settlement>
<li>Tokyo</li>
</settlement>
</list>
<tree>
<country name="Japon">
<noRegion>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
</noRegion>
<name sortKey="Aihara, Kenro" sort="Aihara, Kenro" uniqKey="Aihara K" first="Kenro" last="Aihara">Kenro Aihara</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001335 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001335 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:13F462FB44D7BB031DA1EE909AF927718C78AF9A
   |texte=   Bibliographic component extraction from references based on a text recognition error model
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024